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Abstract 



Bayesian filtering is a general framework for recursively estimating the state of a dynamical 
system. Classical solutions such that Kalman filter and Particle filter are introduced in 
this report. Gaussian processes have been introduced as a non-parametric technique for 
system estimation from supervision learning. For the thesis project, we intend to propose 
a new, general methodology for inference and learning in non-linear state-space models 
probabilistically incorporating with the Gaussian process model estimation. 
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Chapter 1 
Introduction 



This report summarizes the work achieved in the first half of the entire thesis. 
Essentially, it includes the studying of a classic system model which has been widely 
used in many fields and its possible solutions within different scenarios. This work can 
be regarded as a preparation step. In the next half of the project, we will address a 
practical problem with the aid of these methodologies. 

In overview, chapter [2] introduces the classic state-space model and its generic solution 
Bayesian approach. Nevertheless, due to the integrals intractability in practice, 
chapter [3] describes Kalman filter for the linear state-space model while chapter [4] reveals 
the Particle filter methods for the more realistic non-linear model. 
All of these methods rely on the condition that the state-space model information is 
deterministic but in many cases, we deal with the situation with uncertain model 
structure. Rather than deciding the model relates to a specific model, chapter [5] includes 
the concept of a Gaussian process, the Gaussian process regression approach and 
supervision learning of the hyperparameters. In the end, materials that have been 
referred to are included in Bibliography and in the Appendix, MatLab codes for Kalman 
filter, Particle filter and Gaussian process regression are provided. 
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Chapter 2 
Bayesian Approach 



2.1 Introduction - State-space Model 

We consider probabilistic state-space models of the form 



where 

• /: state transition or evolution function 

• Xk,Xk-i- current and previous state 

• Uk-i- known input 

• Wk-i- state noise 

• h: measurement function 

• Zk- observation 

• Uk- known input 

• Vk- measurement noise 




(2.1) 



z k = hk(xk,u k ,v k ) 



(2.2) 
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Our aim is to provide a sequence of optimal (with respect to the minimum mean square 
error criterion (MMSE) estimates Xf. of a process. The true state is hidden and the infor- 
mation available upon which our estimate rely is a set of measurements (or observations) 
{zo-.k}- 

The state-space model is used in the fields of channel estimation in wireless communica- 
tions. For example the Autoregressive (AR) of first order is a well-accepted approximation 
of the Jake's channel update model [10] . Moreover, state-space model has been also widely 
used to predict economic data in finance, track positions in control system and recover 
image or speech in signal processing. 



2.2 Bayesian Approach 

With the fact that the evolution of the state follows a Markov Process of order one 



(Equation 2.1), a Bayesian approach solves the filtering problem in a sequential manner 
by incorporating all observations into account. This amounts to calulating the posterior 
distribution of the state p{xk\zo-,k) at each instant k. Assume that we have the access 
to the known previous state p(xk-i\zo-.k-i)- The idea of forming the required posterior 
of the next state is to combine the previous state information with p(xk\xk-i) from the 
state transition and p(z k \x k ). This prediction step is processed before the new observation 
coming. So as z k is obtained, we advance to the next step to update our prior estimate. 
In overall, the recursion proceeds in two stages, prediction and update as shown following. 

2.2.1 Bayesian Approach - Prediction 

The a prior estimate of the posterior distribution at k is given by 
p(xk\zi:k-i) = / p{x h ,x k -i\zi.. k -i)dx h - 1 



p(x k \x k -l,ZO:k-l)p(Xk-l\zi;k-l)dXk-l 

p{x k \x k - 1 )p(x k ^ 1 \z 1:tl ^ 1 )dx k ^ 1 

where we used the Markov property p(xk\xk-i, £o:fc-i) — p(%k\ x k-i) an d the prediction 
result is known as the Chapman-Kolmogorov equation pp. 



2.2.2 Bayesian Approach - Update 



By incorporating the new observation with the a prior estimate, we can update the pos- 
terior distribution as 

p{xkW.k) = p(xk\zk,z 0:k - 1 ) 

_ V\ X ki z ki z Q:k-l) 

p{Zk,Z 0: k-l) 
_ p(Xk, Zfc\zQ : f.-l) 

p{Zk\z ;k-l) 
_ p(z k \x k , ZQ ± _ 1 )p(x k \zQ ± ^ 1 ) 

p(zk\zo-.k-l) 
_ p(Zk\x k )p(Xk\zi:k-l) 
p(z k \zi :k -l) 

where p(zk\zi-k-i) = f p(z] C \xk)p(xk\zi-k-i)dxi c is the normalizing constant (evidence or 
marginal likelihood). 



2.3 Summary 

In theory, this Bayesian approach utilizes all the information available and it can provide 
a closed form solution to the problem. However in practice, intractable integrals and awk- 
ward equations may often occur and they are impossible to be evaluated analytically [2]. 
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Chapter 3 



Kalman Filter 



3.1 Introduction 

Kalman filter is an algorithm that produces a MMSE estimator of the state process 
recursively. It requires the assumptions such that [T] 

• Noises w and v are i.i.d. drawn from Gaussian distribution with known parameters 

• Evolution function / and update function h are both linear 

Thus, if the previous state p(xk-i\zo-.k-i) is Gaussian, then at the next time step p(x k \zo :k ) 



is Gaussian as well. So the state-space model equations 2.1 and 2.2 can be rewritten as 

x k = F k x k _ x + w k _i 
z k = H k x k + v k 

where F k and H k are known matrices defining the linear functions. In addition, we define 
w k _i has zero mean and covariance Q k -i', v k has zero mean and covariance R k . 



3.2 Kalman Filter - Algorithm 

Suppose that we have been up to one state k— 1 and we have the access to p(xk-i\zi : k-i) = 
A/"(mfc_i|fc_i, Pfc_i|fc_i), the recursive algorithm under the Bayesian framework consists of 
two steps, prediction and update. In this section, we will briefly introduce the Kalman 
filter algorithm. 
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3.2.1 Prediction 



Inserting the previous state into the evolution equation 2.1 we can find a prior distribution 
of the state x k as 

p{x k \zi: k -i) = N{m k \k-x,Pk\k-i) (3.1) 

where 

= Fkrrik-i\k-i 

Pk\k-l = Qk-l + FkPk-l\k-lF^ 

3.2.2 Update 

As we obtain the new observations z k , we are able to update the posterior distribution as 
follows. 

p(xk\zi-.k) = N(mk\k,Pk\k) (3.2) 

where 

m k \ k = m k \k-i + K k (z k - H k m k \ k „i)and 
P k \k — P k \k-i — K k H k P k \k-i 

and the Kalman gain is 

K k = P k \ k ^iHl \H k P k \ k _iH^ + R k )~ l 



3.3 Simulation 

Consider an AR{2) example as follows. 

x k = 2cos(2n f)x k _i - x fc _ 2 

z k = Xk + V k 
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where Vk is a Gaussian noise with zero mean and variance var_v. This problem can be 
rewritten in state-space form such that 







2cos{2nf) 


-1 




Xk-l 


Xk-1 




1 


Xk 







Xk-2 


Zk = 


1 




+ Vk 





Xk-1 

Using the Kalman filter algorithm, we obtain the simulation result in the following figure 
and the MatLab codes are included in the Appendix 1 and 2. 



Observations 
Estimated states 




Figure 3.1: Kalman filter for AR2 SSM 

3.4 Summary 

With the assumptions held, Kalman filter provides the optimal solution in this linear 
Gaussian environment. However when the assumptions of system linearity and Gaussian 
noise are not available, Kalman filter does not perform well. In the next chapter, we 
will describe an algorithm that performs superior for the non-linear state-space model 
problems. 
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Chapter 4 



Particle Filter 



4.1 Introduction 

For linear Gaussian state-space model, Kalman filter is served as an optimal recursive 
estimator under the Bayesian framework. However, what if the state-space model is not 
restricted as linear and Gaussian? Instead of Kalman filter and its approximation jl], we 
will introduce particle filtering methods to solve these estimation problems numerically 
in an online manner - recursively as observations become available. 

Particle filters perform sequential Monte Carlo (SMC) estimation based on point mass 
(or particles) representation of probability densities. Thus the key idea to resolve this 
state-space model probelm is to represent the state posterior density function by a set 
of random samples (also known as particles) with associated weights. As the number of 
samples approaches infinity, particle filter result approaches the optimal Bayesian solution. 



4.2 Monte Carlo Integration 

Monte Carlo integration is the basis of SMC methods. Suppose we want to numerically 
evaluate a multidimensional integral 

g(x)dx 
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Monte Carlo (MC) methods for numerical integration can be factorized g(x) = f(x)ir(x) 
in such a way that n(x) is interpreted as a probability density satisfying n(x) > and 
f ir(x)dx = 1. Drawing N ^ 1 samples x\ i — 1, . . . , N distributed according to n(x), 
the MC estimate of integral 

/ = J f{x)n(x)dx 



with the sample mean 



1 N 

i=i 



If the samples x l are independent then 1^ is an unbiased estimate and according to the 
law of large numbers In converges to the true value of /. 

Ideally we want to generate samples directly from ti(x) but in the context of filtering, 
ir(x) is the posterior whose samples we cannot obtain. Instead, we perform the sampling 
from a density q(x) named as the importance or proposal density. Following the principle 
of Importance Sampling, this proposal density is an approximation density to the true 
density ir(x). In this case, the integral of / can be rearranged as 



/ 



7T ( X ) 

f(x)n(x)dx= I f(x)——q(x)dx 



q(x) 

A Monte Carlo estimate of I is computed by generating independent samples distributed 
according to q(x) and forming the weighted sum 

1 N 

i=i 

where w{x l ) = are the importance weights and they can be normalized 

w ( x ) = N 

Then we estimate 1^ using the normalized importance weights to evaluate the integral 

N 

V 



1 N 



i=l 
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4.3 Sequential Importance Sampling 

The sequential importance sampling (SIS) is a Monte Carlo method upon which most 
sequential MC filters are relied on. This sequential Monte Carlo (SMC) approach is known 
variously as bootstrap filtering, the condensation algorithm, particle filtering, interacting 
particle approximations, and survival of the fittest. pQ Essentially it is a technique to 
implement a recursive Bayesian filter with the aid of Monte Carlo simulations. The 
principle is to represent the posterior density function by a summation of a set of random 
samples (particles) with associated weights and the tasks can be simplified to be finding 
the proper samples and their corresponding weights. By the law of large number, this 
approximation approaches to the real posterior density function as the number of samples 
becomes very large. In another word,the SIS filter becomes the optimal Bayesion estimator 
when N s approaches infinity. 

Before developing the details of the algorithm, we introduce {x l Q . k ,w l k } to be a random 
measure that characterizes the posterior pdf p(x .k\zi t k) where {x l 0:k , i = 1, . . . , N s } is a set 
of support points (particles) with associated weights {w k , i — 1, . . . , N s }. The weights are 
normalized such that Yli w l = 1- Then the posterior density at k can be approximated 

as 

p(x k \zi :k ) & ^wl5(x k -xl) 

i=l 

This is interpreted as the weighted approximation of the true posterior p{xQ-k\zi-k) . The 
normalized weights w l k are chosen based on the principle of Importance Sampling. There- 
fore, if the samples x l 0:k were drawn from an importance density q(x 0: k\zi : k) , then the 
weights become 



p(x\ 


\Zk) 


q(x\ 


\Zk) 



If the importance density can be factorized like this 

q(xo:k\zi;k) — q(xk\x ;k-i, Zi : k)q(x 0: k-i\zi;k-i) 

then we can obtain samples x l . k ~ ^(xo^l^i^) by augmenting each of the existing samples 
x o-.k~i ~ q( x 0:k-i\ z i-k-i) with the new state x\ ~ q [x k \ Xo-.k-i, Zx-k)- The full posterior 
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distribution can be rearranged as 

p(zk\x 0l k,Zi : k-l)p{xo:k\zi:k-l) 



P(x -.k\zi:k) 



p(Zh\Zv.k-l) 

_ P( z k\ x 0:ki z l:k-l)p(Xk\x 0: k-l, ^l:fc-l)p(^0:A:-l \Zl:k-l] 

p{z k \z v .k-x) 

_ P{Zk\Xk)p{Xh\Xk-l)p{XQ:k-l\Z\:k-l) 
P{Zk\Zl-. k -l) 

oc p(z k \x k )p(x k \x k -. 1 )p(x Q . Je - 1 \z 1 .. k - 1 ) 



and the weight update is 



P(z k \x k )p{x\\ 


• L k-l)P\ x 0:k-l 


\ z l:k-l) 


9(41 


X 0:fc-1> 


Zl:k)q(x % Q . k _ 1 \. 


z l:k-l) 



W k OC 

= nji p(^I4)p(4I4-i) 

q(x k \x l . k _ 1 , Z\ lk ) 

Furthermore, if q(x k \xo :k ^x, Zi :k ) = q{x k \x k -i, z k ) then the importance density appears 
only related to x k -i and z k . This turns out to be useful when only a filtered estimate of 
p{xk\zi-.k) (incomplete posterior) is required at each step. In this case, the weight update 
becomes 

i i p(*kl4M4l4-i) 

W k OC W k i . . i ■ r 

q{xl\xl_ 1)Zk ) 

and finally, the prediction of posterior filtered density is approximated as 

N s 

p{x k \z Uk ) « ^ w lK X k - X\ 



In summary, the SIS algorithm is formed by recursive propagation of importance weights 
w k and particles x\ as the sequential observation is obtained at each step. The algorithm 
is described in Algorithm [I] [9] 



4.4 Resampling 

Degeneracy Problem. A common problem associated with SIS particle filter is the de- 
generacy phenomenon where after a few recursive steps, all but one particle will have 
negligible weights. It implies that a large amount of computational effort is wasted in 
updating particles whose contribution is almost zero. A simple approach to resolve this 
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Algorithm 1 SIS Particle Filter 



• FOR % = 1 : N 

- Draw x k ~ g^x^, z k ) 

- Evaluate the importance weights 



wl = w l 



p( z *\K)p(K\K-i) 
A ~ l g(xj c |x[ );k _ 1 ,z 1:k ) 



END FOR 
FOR i = l: N 
Normalizing weight :w\ 



T N w j 

1^3 = 1 W k 

END FOR 



problem is to increase N but this will increase the computational cost which is unaccept- 
able in practice. Instead, we introduce the concept effective sample size N e ff which is 
evaluated as 

N eff = 1 (4.1) 

Ei=iK) 2 

Small N e fj indicates severe degeneracy so the approach is to perform the resampling when 
N e ff is below some threshold. The idea of resampling is to eliminate the low- weighted 
particles and to concentrate on particles with large weights. It involves a mapping of 
random measure {x l . k ,wl} into a random measure {x l . k * ,1/N} with uniform weights 
and an efficient resampling algorithm named systematic resampling is described in Algo- 
rithm^ [9] 

Other possible resampling algorithms can be referred to [5]. So far we have defined the 
main steps of a generic particle filter. The complete generic particle filter algorithm is 
summarized in Algorithm [3j |9j And the simulation result using this algorithm is shown 
in the following figure 
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Algorithm 2 Resampling 

[a^WjtJ = Resampling[x k , w k ] 

• Find cumulative sum (CS) of the weights 

• Start from the bottom of CS: i=l 

• Draw a starting point U\ ~ U[0, N' 1 } 

• FOR j = 1 : N s 

- Move along the CS: Uj — U\ + N^ 1 ^ — 1) 

- WHILE Uj > d 

- i + + 

- END WHILE 

- Assign sample x 3 k = Uj 

- Assign weight w k = N^ 1 

• END FOR 



Algorithm 3 SIS Particle Filter 
= ^[xk-iSwl-i^k] 

• Filtering via SIS [T] 



Calculate N e ff using 

IF N e ff j Nxhreshold 

Resampling 
END IF 



4.1 



15 




Figure 4.1: Particle filter with systematic resampling 

4.5 Simulation 

Consider a non-linear SSM example using the Particle filter algorithm with systematic 
resampling as follows. 

x k = ~7T~ + 2 ^ Xfc ~ 1 + 8cos(1.2(k - 1)) + w k _ x 
1 1 + x k-i 

where w and v are both Gaussian noise with zero mean and variance var_w, var_v respec- 
tively. In the simulation, we use 500 particles and track 50 time steps. 

4.6 Summary 

In this chapter we described the generic sequential importance sampling algorithm which 
serves as a basis for most particle filters. Compromising by the cost of high computational 
complexity, this generic particle filter can solve the non-linear state-space model with a 
good performance result. However, there are some special cases of SIS algorithms which 
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are derived by an appropriate choice of importance sampling density and/or modification 
of the resampling step. Possible special particle filters are p] 

• sampling importance resampling (SIR) filter 

• auxiliary sampling importance resampling (ASIR) filter 

• regularized particle filter (RPF) 

In accordance with the practical problem, we will select a suitable particle filter to be 
deployed. 
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Chapter 5 



Gaussian Process 



Kalman filter (Chapter [3]) and Particle filter (Chapter [4]) all rely on the condition that we 
have the deterministic state-space model structure. However in many cases, we deal with 
the problem that involves the SSM with uncertain structure. We are therefore required 
to jointly estimate the model structure as well as the state of the model. Rather than 
deciding that the unknown function relates to some specific models, a Gaussian process 
can represent the function flexibly, but rigorously, by letting the data decide the model 
structure. In this chapter, we will introduce how to use Gaussian processes for regression 
problems. 

5.1 Prediction Problem 

A typical prediction problem is that given some noisy observations of a dependent variable 
at certain values of the independent variable x, what the best estimate of the dependent 
variable at a new value OC ^ IS. This is modeled as 

Vn = f w {x ri )+N{tt,(J 2 n ) 

5.2 Bayesian Inference 

The Bayesian approach is used for inference based upon the expression of knowledge in 
terms of probability distributions. Given the data and a specific model, we can deter- 
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ministically make inferences using the rules of probability theory. Within the Bayesian 
approach to regression, we firstly infer the parameters w of the model given the data and 
then to make predictions based on the chosen models and parameters. 
We start by expressing prior beliefs about the model for the data in terms of a probability 
distribution over all possible function models, p(Mj). Then we express prior beliefs about 
the value of model parameters as p(w\Mi). 

Including the data x and y, we infer the parameters of the model given the data 

/ i A/r \ p(w\Mi)p(y\x,w,Mi) 

p(w\x,y,Mi) = — — — 

p{y\x,Mi) 

where p(w\x, y, Mi) is the posterior, p(y\x,w,Mi) is the likelihood, p(w\Mi) is the prior 
and p{y\x, Mi) is the evidence or marginal likelihood. 
Next, we combine the evidence 

p(y\x,Mi) = J p(w\Mi)p(y\x,w,Mi)dw 

with prior belief and apply Bayes' theorem once more to find the model probability 

p{y\x) 

where p(y\x) is the normalizing constant and this posterior distribution p(Mi\x,y) allows 
us to rank different models. 

Finally, we make the predictions of the future data relied on all of above equations. 
p{y*\x* ,x,y,Mi) = J p(y*\w,x* ,Mi)p(w\x,y,Mi)dw 

Despite Bayesian approach provides a uniquely optimal solution to the regression problem 
in theory, solutions may be difficult to find as in practice. The fundamental difficulty 
of Bayesian approaches centers around the mathematical complexity where intractable 
integrals and awkward equations may often occur [3]. 



5.3 Gaussian Processes 



Alternatively, Gaussian process techniques are introduced to formulate a Bayesion frame- 
work for regression [7] in a flexible and rigorous manner. Initially we start with the basic 
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multivariate Gaussian distribution (MVN) 

p(x\n, E) = N{fi, E) = (2tt)- d/2 |S|- 1/2 exp(-^(x - /i) T S~ 1 (x - fj)) 

where the mean vector fi e IR 13 and the covariance matrix E e M' DxD . As a generalization 
of of the MVN, a Gaussian process (GP) is extending the D dimensions into infinity which 
can be used to model a function which can be viewed as an aggregate for infinite quantity 
of random variables. The formal definition for a GP[8] is as follows. 

Definition 1. A Gaussian process is a collection of random variables, any finite number 
of which have a joint Gaussian distribution. 

A Gaussian process (GP) is fully characterized by its mean function m{x) and covari- 
ance function (also known as kernel) k(x,x') which are defined as 

m(x) = E[f(x)} 
k(x,x') = K[(f(x) — m(x))(f(x') — m(x'))] 

and we write the Gaussian process as 

f(x) ~ QV(m(x), k(x,x')) 

Note that the individual random variables in a vector from a Gaussian distribution 
are indexed by their positions in the vector instead of the time instants. For the Gaussian 
process it is the argument x of the random function f(x) that plays the role of index set: 
for every input x there is an associated random variable f(x), which is the value of the 
stochastic function / at that location. 

Although it seems unwieldy to work with an infinitely long mean vector and an infinite 
covariance matrix, it turns out that the quantities that we are interested in computing 
require only working with finite dimensional objects. For any GP / ~ QV(m, k) we only 
put attention on a finite subset of function values / = (f(xi),f(x2),---,f(x n )) which 
follows a regular Gaussian distribution such that 
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where \i = 0, S^- = k(xi,Xj). To clarify the distinction between process and distribution 
we use m and k to the former and \i and E for the latter. By using the properties of MVN 
we can make a prediction on y* based on the training pairs {x, y} and the test input x*. 

5.4 Posterior Gaussian Process 

In the previous section we saw how to define distributions over functions using GPs. This 
GP will be used as a prior for Bayesian inference. We are usually not primarily interested 
in drawing random functions from the prior, but want to incorporate the knowledge that 
the training data provides about the function. Let us start with the simple special case 
where no noise is added on the observation. The joint distribution of the training outputs 
/ and the test outputs /* according to the prior is 

K(X, X) K(X,X m ) 
K(X*,X) K{X*,X*)) 

where we have m for the training means and similarly m* for the test means. Also, we 
have K for training set covariances, for training-test set covariance and for test 
set covariance. 

Lemma 1. The formula for conditioning a joint Gaussian distribution is [7] 

P{x\y) ~ AT{a + CB-\y -b),A- CB- l C T ) 

Since we know the values for the training set / we can obtain the conditional distribution 
of /* given / as 

f*\f ~ N{m* + K(X.,X)K(X, X)-\f - m), 

K{X„X,) - K{X*,X)K(X, X^KiX, X,)) 

This is a prediction based on noise-free observations. In practice, it is more realistic 
modeling situations without the access to function values themselves. Instead, we only 
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obtain the noisy observations thereof y = f(x) + Af(0, a 2 ). 

Incorporating with the additive independent identically distributed (i.i.d.) Gaussian noise, 
we form a single kernel such that 

cov(y) = K(X,X) + a 2 n I 

Thus we modify the joint distribution of the observed target values and the function 
values at the test locations under the prior as 

K(X,X) + a 2 J K(X,X.) 
K(X*,X) K{X.,XJ) 

A Gaussian process posterior is 

f(x*)\x,y ~ QV(m post (x),k pos t(x,x')), where 

m post (x) = m* + k(x*,x) T (K(x, x) + c^i") _1 y, 
k P ost( x ,x') = k(x*,x*) — k(x*, x) T (K(x, x) + a 2 I)~ 1 k(x* , x) 

This leads us to the key predictive equations for Gaussian process regression 

y* \ x* iX^y ~ ft/(m(y*),cov(y*)), where (5.1) 
m(y*) = m* + K(X., X)(K(X, X) + a 2 n I)-\y - m), (5.2) 
cov{y*) = K(X*,X*) + a 2 n - K(X*,X)(K(X,X)+a 2 n I)- 1 K(X,X*) (5.3) 

Note that the variance is independent of the observed outputs y and it is the difference 
between the prior variance and a positive term, representing the information the obser- 
vation gives us about the function. 
Consider an example of the Gaussian process. 
Example. 

y = f(x)+X(0,a 2 n ) 

f~gv(o,k(x,x')) 

k(x, x') = exp( — (x — x') 2 ) 



y 


~A/"( 


m 








y* 
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Solution. Zero mean Gaussian process prior leads to the Gaussian predictive distribution: 

y*\x*,x,y ~ Af(k(x*,x) T (K(x,x) + a 2 n iy l y, (5.4) 
k(X*, X*) + a 2 n - k(x*, x) T {K(x, x) + a 2 J)~ l k{x\ x)) (5.5) 

A practical implementation of Gaussian process regression is shown in the figure and the 
MatLab code is in Appendix 



Gaussian Process Regression 




'-2 -1.5 -1 -0.5 0.5 1 1.5 2 



Figure 5.1: Gaussian process regression 

Instead of directly inverting the matrix, Cholesky decomposition of a matrix can be 
used since it is faster and numerically more stable. A good feature of GP is that it gives 
both the predictive mean (the blue curve) and 95% posterior confidence region (the grey 
shaded area). 



Note that in the result ( 5.4), the mean prediction is a linear combination of observa- 
tions y when the prior mean is zero. This is often referred to as a linear predictor [8] and 
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this mean equation can be re-written as 

n 

i=i 

where oti = (K + a 2 % I)~ l y and K is the compact form of K(x,x'). This can be seen as 
a linear combination of n kernel functions, each one centered on a training point. Intu- 
itively, although the GP defines a joint Gaussian distribution over all of the y variables, 
one for each point in the index set X, for making prediction at x* we only care about the 
(n + 1) dimensional distribution defined by the n training points and the test point. [8] 
This prediction can be given by conditioning this (n + l) dimensional distribution on the 
observations as shown above. 



5.5 Training a Gaussian Process 

Now a question left is which kernel function to choose and how to determine the hyper- 
parameters. In the light of training data, we need to find reliable prior information about 
the training data set with prior mean and covariance functions specified before making 
regression. However, the availability of such detailed prior information is not valid nor- 
mally. Referred as the training of GP, we need to form a mean and kernel function as the 
GP prior and in the light of observations, we calculate the appropriate hyperparameters 
within the function. 
Task 1. Form Covariance Function 

There are a set of well known covariance functions which are appropriate in different 
cases. [H] 

• Long-term smooth trend - Square Exponential 

ki(x, x) = Q\ exp(— (x — x') 2 /6l) 

• Seasonal trend - Quasi-periodic Smooth 

k 2 (x,x') = 9 2 3 exp( y -2sm 2 (n(x-x'))/9 2 5 ) x exp(-^(x - x'f/Of) 
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• Short- and medium-term anomaly - Rational Quadratic 

k 3 (x, x ') = el(i + { ^f)- es 

• Noise - Independent Gaussian and Dependent 

k 4 (x, x') = 9l exp(- — * - ) + Q\^ xx i 

By linearly combining them we obtain a comprehensive covariance function that utilizes 
the comparative advantages and compensates the drawbacks to large extent. 

k(x, x) = ki(x, x) + kz(x, x) + k%(x, x) + k±(x, x') 

Task 2. Find Hyperparameters 
For a Gaussian Process, 

f~GV{m,k) 

the mean and covariance functions are parameterized in terms of hyperparameters 9 = 
{9 m , 9k} where 9 m and 9^ indicate hyperparameters of mean and covariance functions 
respectively. In order to find the values for these hyperparameters, we compute the prob- 
ability of the data given the hyperparameters by introducing the log marginal likelihood 
(or evidence) since by assumption the distribution of the data is Gaussian: 

1 In 

L = logP(y\x, 9) = --(y - mf K~\y - m) - -log\K\ - -log(2ir) 



Then we can find the values of hyperparameters which optimizes the marginal likelihood 
based on its partial derivatives: 

dL = _ m jT K -i dm 
89 m 89 m 

dL 1. n7 , i 8K T ^ .1 . ,8K. 

_ = -(,- m ?K-i-K-\y - m) - frace^ -) 

The log marginal likelihood form consists of three terms: The first term — \(y—m) T K~ 1 (y— 
m) is a negative quadratic and plays the role of a data fit measure as it is the only term 
which depends on the training set output values y. The second term — \log\K\ is a com- 
plexity penalty term, which measures and penalizes the complexity of the model. The 
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third term is a log normalization term that is independent of the data. Note that the 
tradeoff between penalty and data fit - Occam's Razor - in the GP model is automatic. [7j 
There is no weighting parameter which needs to be set by external method and this feature 
has great practical importance since it simplifies training. 



5.6 Summary 

In this chapter we have introduced the basic concept of Gaussian process with its appli- 
cation on how to solve the regression problem with a GP flexibly as well as rigorously. 
Moreover, we illustrated multiple common-used kernel functions and the method deployed 
to resolve the hyperparameters associated. 
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Chapter 6 



Conclusion and Future Work 



Following the methodology-oriented research principle, fundamental knowledge of classical 
approaches to solve the state-space model with known structure are learnt. In thesis B, 
Gaussian process prior is to be incorporated with particle filter to solve some practical 
problem in wireless communications like channel estimation, which involves a non-linear 
state-space model with structure uncertainty. In analogy, after drawing a series of discrete 
points in the paper, we will find a proper line to connect those points to contribute to an 
agreeable outcome. 

Future work may include combining Gaussian process prior within state-space model to 
solve some practical problems in wireless communications. One possible problem is the 
channel tracking in relay networks [B] where the system model is illustrated in the following 
figure. If we assume the relay function is unknown, then this channel tracking problem 
involves a non-linear state-space model with parameters estimation. In this case, we need 
to incorporate Gaussian process for the function estimation with particle filter for the 
channel state information recovery. 



27 




Figure 6.1: Relay network system model 
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Appendix 1 - Kalman Filter for AR2 
model 



clear all; 
close all; 
clc 

% System Model 

% [x(n+l) ;x(n)] = [cos(2*pi*f) -1; 1 0] * [x(n) ;x(n-l)] ; 
% y(n) = [1 0]*[x(n);x(n-l)] + v(n) ; 



f = 0.05; 
theta=l ; 

F = [2*cos(2*pi*f ) -1;1 0] ; 
H = [10]; 



R = 0.1 
Q = 0.1 
N = 300 



% Measurement noise covariance 
% Process noise covariance 



x_state=zeros(2,N) ; % 

x_hat=zeros(2,N) ; % 

P=zeros(2,2,N) ; % 

x_priori=zeros(2,N) ; °/„ 

K=zeros(2,l,N) ; % 



the real state 
estimate state 

covariance error matrix N*(2,2) 
aprior estimate state 
Kalman gain 
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°/„ System model setup 
x_state ( : , 1)= [sin(theta) ; 0] ; 

x_state ( : , 2)= [sin(2*pi*f + theta) ; sin(theta)] ; 
for t=3:N 

x_state( : ,t)=F*x_state( : ,t-l) ; 

end 

for t=l:N 

v=normrnd(0, sqrt (R) ,1,1); 
y(t) = H*x_state( : ,t) + v; 

end 

°/ Initial guess 

x_hat_initial= [sin(theta) ; 0] ; % random initial state estimate 

P.initial = [1 0; 1] ; 

% First round of Kalman Filter 

[x_hat ( : , 1 ) , x_prior ( : , 1) , P(:,:,l), K(:,:,l)] = KalmanFilter (x_hat_initial , P_initial, 
for t=2:N 

[x_hat(:,t), x_prior( : ,t) , P(:,:,t), K(:,:,t)] = KalmanFilter (x_hat ( : ,t-l) , P(:,:,t- 

end 

t=l:N; 

figure 

plot (t , x_state , ' b ' , t , y , ' k . ' , t , x_hat , ' r ' ) 
grid on 
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Appendix 2 - Kalman Filter Function 



function [x, x_prior, P, K ] = KalmanFilter (x, P, z, F, H, Q, R) 

°/„ Projection 
x_prior = F*x; 
p=F*P*F ) +Q; 

°/„ Kalman gain 

K = P*H'*inv(H*P*H'+R) ; 

% Update estimate 

x = x_prior + K*(z-H*x_prior) ; 

% Update co variance 
dimension=size(K*H, 1) ; 
P= (eye (dimension) -K*H) *P ; 
end 
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Appendix 3 - Particle Filter 

°/„°/ Clean up 
clear all 
close all 
clc 

°/„°/ Set up problem parameters 

randn( ' state 5 , 1) °/ initialize Gaussian random number generator 
rand( 'twister ' , 1) % initialize uniform random number generator 
N = 500; °/ # of particles 
K = 50; % # of timesteps 
T = 0:K; % time vector 

°/ °/ Generate data 

vr_w = 0.1; °/ variance of Gaussian noise parameter w 
vr_v = 0.5; % variance of Gaussian noise parameter v 
xO = 0.1; % initial state value 
P0 = 0.1; % initial state variance 
x = xO; 

°/„ generate state and measurement vectors 
for i=2:K+l 

x(i) = x(i-l)/2 + 25*x(i-l)/(l+x(i-l)~2) + 8*cos (1 . 2* (i-1) ) + sqrt (vr_w) *randn; 
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end 

z = (x.~2)./20 + sqrt (vr_v) . *randn(size (x) ) ; 
°/„°/o Initialize particle filter 

°/„ The set of particles and their weights are denoted by j Xk j and j Wk j 
°/„ respectively, while j mn j is the mean of the particle distribution. It is 
% assumed that j xO j is known and we chose out initial state pdf to be a 
% Gaussian distribution about j xO j with the variance PO. 

Xk = xO + randn(l,N)*sqrt(PO) ; % initial particle population 

Wk = (l/sqrt(2*pi*P0))*exp(-(Xk-x0) . ~2/(2*P0)) ; % initial weight dist 

Wk = Wk/sum(Wk) ; % weight normalization 

mn = Xk*Wk ; ; % initial particle mean 

maxX = max(Xk) ; 

minX = min(Xk) ; 

°/o°/o Run particle filter 

for t=2:K+l 

"/Propagate particles 

Xk = Xk./2 + 25*Xk./(l+Xk."2) + 8*cos (1 . 2* (t-1) ) + sqrt (vr_w) *randn; 

"/Update weights 
"/posterior pdf 

Wk = Wk.*((l/sqrt(2*pi*vr_v))*exp(-(z(t)-(Xk.-2) 720) . ~2/ (2*vr_v) ) ) ; 
Wk = Wk/sum(Wk) ; 

°/ Infer particle mean (aggregate state estimate) 
maxX(t) = max(Xk); 
minX(t) = min(Xk) ; 
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mn(t) = Xk*Wk' ; 



°/ Multinomial resampling 
n_thr = 0.25*N; 
n.eff = l/(sum(Wk."2)) ; 
if n_eff<n_thr 

cs = cumsum(Wk) ; °/ generate cumulative sum 

% vector for the weights (CSW) 

for i=l:N 

indx = min(find(cs > rand)); % find CSW index for which the 

°/ CSW just exceeds the random number 

Xk(i) = Xk(indx) ; °/„ replicate the corresponding 

°/ particle in the new population 

end 

Wk = ones (size (Wk))/N; % assign uniform weights to 
°/„ resampled particles 



end 

end 

plot (x) 
hold on 
plot(mn, 'g') 
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Appendix 4 - Gaussian Process 
Regression 



% Posterior prediction 

u 

clear all; 
close all; 
clc 

°/„°/ Training data 
var_n=0.1; % noise variance 
var=l; % kernel hyperparameter 

1=0.5; °/ kernel hyperparameter 

training_x= [-1 : . 2 : 1] ; 

number_data=length(training_x) ; °/ number of training data 

K=se_cov(training_x,training_x,var,l) ; % covariance matrix 
mean_y=zeros (number_data, 1) ; 

training_y=mvnrnd(mean_y ,K) ; % y~N(0,K) 

training_y=training_y ' +sqrt (var_n) *randn(number_data, 1) ; % y=f (x)+noise 

%% Predict test data 

test_x=[-2:0.001:2] ; % test input 

mean_test_y=zeros (1 , length(test_x) ) ; 

for i=l : length(test_x) 

mean_test_y (i)=se_cov(test_x(i) ,training_x, var , 1) ' *inv(K+var_n*eye (number_data) ) *tra 
var_test_y (i)=se_cov(test_x(i) ,test_x(i) ,var,l)-se_cov(test_x(i) ,training_x,var,l) '* 
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end 

°a piot 

plot(test_x,mean_test_y, 'r' ,training_x,training_y, 'obO 
cf _upper=mean_test_y+2*sqrt (var_test_y) ; 
cf _lower=mean_test_y-2*sqrt (var_test_y) ; 
f = [cf_upper; f lipdim(cf _lower , 1)] ; 

fill([test_x; f lipdim(test_x, 1)] , f, [7 7 7]/8, 'EdgeColor', [7 7 7] /8) 
hold on 

plot (test_x ,mean_test_y , training_x , training_y , ' ob ' ) 
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Appendix 5 - Covariance Function 



°/„ Calculate covariance funciton 
function K = se_cov(x, y,var,l); 
K=zeros (length(x) ,length(y)) ; 
for i=l : length(x) 

for j=l : length(y) 

K(i,j)=var*exp(-0.5/l*(x(i)-y(j))-2) ; 

end 

end 
K=K' ; 
end 
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